clientv3: fix balancer/retry #8710

gyuho · 2017-10-19T01:17:51Z

Fixing/Improving

rpctypes.EtcdError handling with health balancer
grpc/grpc-go error handling (e.g. transport.ConnectionError, grpc.downErr)
stale endpoint handling (e.g. SetEndpoints)
retry logic with grpc.FailFast=true as default
distinguish mutable/immutable RPCs in Auth, Maintenance (e.g. MoveLeader as mutable RPC)
empty endpoint handling ("there is no address available" then wait for connection and retry)
client logging with empty pinned address (use %q)
TestNetworkPartitionBalancer* fails #8677
TestWatchKeepAlive fails #8678
TestSTMAbort blocks #8686
clientv3/ordering TestUnresolvableOrderViolation fails (after 'unhealthy' patch) #8694
TestCtlV3WatchInteractivePeerTLS fails #8687 (temporary fix)

TODO: Investigate CI failures?

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Cherry-pick 22c3f92f5faea8db492fb0f5ae4daf0d2752b19e. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

1. Handle stale endpoints in health balancer. 2. Rename 'unhealthy' to 'unhealthyHosts' to make it clear. 3. Use quote format string to log empty hosts. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

…erClose Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

…GetResetLoneEndpoint Since we have added additional wait/sync when error "there is no available address". Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

Integration tests have dial timeout 5-sec, and it's possible that balancer retry logic waits 5-sec and test times out because gRPC calls grpc.downErr function after connection wait starts in retrial part. Just increasing time-out should be ok. In most cases, grpc.downErr gets called before starting the wait. e.g. === RUN TestWatchErrConnClosed INFO: 2017/10/18 23:55:39 clientv3/balancer: pin "localhost:91847156765553894590" INFO: 2017/10/18 23:55:39 clientv3/retry: wait 5s for healthy endpoint INFO: 2017/10/18 23:55:39 clientv3/balancer: unpin "localhost:91847156765553894590" ("grpc: the client connection is closing") INFO: 2017/10/18 23:55:39 clientv3/health-balancer: "localhost:91847156765553894590" becomes unhealthy ("grpc: the client connection is closing") --- F.A.I.L: TestWatchErrConnClosed (3.07s) watch_test.go:682: wc.Watch took too long Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

xiang90 · 2017-10-19T17:33:35Z

clientv3/balancer.go

@@ -152,7 +152,9 @@ func (b *simpleBalancer) pinned() string {
 	return b.pinAddr
 }

-func (b *simpleBalancer) endpointError(addr string, err error) { return }
+func (b *simpleBalancer) endpointError(host string, err error) {
+	panic("'endpointError' not implemented")


why panic here?

xiang90 · 2017-10-19T17:35:06Z

clientv3/retry.go

 	return func(rpcCtx context.Context, f rpcFunc) error {
 		for {
 			pinned := c.balancer.pinned()
 			err := f(rpcCtx)
 			if err == nil {
 				return nil
 			}
-			if logger.V(4) {


we should keep it here

xiang90 · 2017-10-19T17:35:28Z

clientv3/retry.go

 			// always stop retry on etcd errors other than invalid auth token
 			if rpctypes.Error(err) == rpctypes.ErrInvalidAuthToken {
 				gterr := c.getToken(rpcCtx)
 				if gterr != nil {
+					if logger.V(4) {


log the error here and log we cannot retry due to this error.

xiang90 · 2017-10-19T17:36:08Z

clientv3/health_balancer.go

@@ -48,8 +48,8 @@ type healthBalancer struct {
 	// eps stores all client endpoints
 	eps []string

-	// unhealthy tracks the last unhealthy time of endpoints.
-	unhealthy map[string]time.Time
+	// unhealthyHosts tracks the last unhealthy time of endpoints.


unhealthy hosts or endpoints?

xiang90 · 2017-10-19T17:37:26Z

clientv3/health_balancer.go

+				if _, ok := hb.host2ep[k]; !ok {
+					delete(hb.unhealthyHosts, k)
+					if logger.V(4) {
+						logger.Infof("clientv3/health-balancer: removes stale endpoint %q from unhealthy", k)


from unhealthyEndpoints?

xiang90 · 2017-10-19T17:38:44Z

clientv3/integration/kv_test.go

@@ -473,8 +473,8 @@ func TestKVNewAfterClose(t *testing.T) {

 	donec := make(chan struct{})
 	go func() {
-		if _, err := cli.Get(context.TODO(), "foo"); err != context.Canceled {
-			t.Fatalf("expected %v, got %v", context.Canceled, err)
+		if _, err := cli.Get(context.TODO(), "foo"); err != context.Canceled && err != grpc.ErrClientConnClosing {


rpc should only return status type error. why grpc ErrClientConnClosing will be returned?

xiang90 · 2017-10-19T17:40:20Z

clientv3/integration/leasing_test.go

@@ -1083,8 +1086,8 @@ func TestLeasingOwnerPutError(t *testing.T) {
 	clus.Members[0].Stop(t)
 	ctx, cancel := context.WithTimeout(context.TODO(), 100*time.Millisecond)
 	defer cancel()
-	if resp, err := lkv.Put(ctx, "k", "v"); err == nil {
-		t.Fatalf("expected error, got response %+v", resp)
+	if resp, err := lkv.Put(ctx, "k", "v"); err != context.DeadlineExceeded && !strings.Contains(err.Error(), "transport is closing") {


rpc call should only return context type or status type error.

xiang90 · 2017-10-19T19:11:28Z

clientv3/integration/kv_test.go

@@ -881,7 +881,7 @@ func TestKVGetResetLoneEndpoint(t *testing.T) {
 	// have Get try to reconnect
 	donec := make(chan struct{})
 	go func() {
-		ctx, cancel := context.WithTimeout(context.TODO(), 5*time.Second)
+		ctx, cancel := context.WithTimeout(context.TODO(), 8*time.Second)


why it takes 8 seconds to reconnect?

xiang90 · 2017-10-19T19:18:39Z

clientv3/retry.go

+
+const minDialDuration = 3 * time.Second
+
+func (c *Client) newRetryWrapper(write bool) retryRPCFunc {


this becomes far more complicated than what we discussed.

retry should only care about retry. we should put the initial wait connection logic in another func.

xiang90 · 2017-10-19T19:23:03Z

clientv3/integration/watch_test.go

@@ -678,7 +678,7 @@ func TestWatchErrConnClosed(t *testing.T) {
 	clus.TakeClient(0)

 	select {
-	case <-time.After(3 * time.Second):
+	case <-time.After(10 * time.Second):


if dial timeout is 5 seconds, then 6 seconds should be enough? also can we simply reduce dial timeout?

xiang90 · 2017-10-20T22:36:49Z

we will send new prs to replace this. this one is too messy. closing.

gyuho added 16 commits October 18, 2017 16:19

words: whitelist more

4818f23

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

vendor/grpc-go: cherry-pick WriteStatus fix from v1.7.0

1a12a64

Cherry-pick 22c3f92f5faea8db492fb0f5ae4daf0d2752b19e. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

glide: add note to grpc version

bc45227

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/health: handle stale endpoints, rename to 'unhealthyHosts'

a03f0d1

1. Handle stale endpoints in health balancer. 2. Rename 'unhealthy' to 'unhealthyHosts' to make it clear. 3. Use quote format string to log empty hosts. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: clean up logging, variable names

5351ae2

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/balancer: fix retry logic

26291ab

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in KV, set FailFast=true

8fac498

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in Watch, set FailFast=true

d13827f

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in Lease, set FailFast=true

f90a0b6

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in Cluster, set FailFast=true

d79e655

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in Maintenance, set FailFast=true

ff1b2cd

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: remove redundant retries in Auth, set FailFast=true

4e0b52d

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/integration: match grpc.ErrClientConnClosing in TestKVNewAft…

140cb0c

…erClose Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/integration: match ErrTimeout in testNetworkPartitionBalancer

5bdbcba

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/integration: match errors in leasing tests

1ee5f32

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/integration: increase time-out for endpoint switch in TestKV…

1e217ef

…GetResetLoneEndpoint Since we have added additional wait/sync when error "there is no available address". Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho added the WIP label Oct 19, 2017

gyuho force-pushed the fix-balancer-retry branch from 430070b to 7c9cf81 Compare October 19, 2017 04:20

xiang90 reviewed Oct 19, 2017

View reviewed changes

gyuho mentioned this pull request Oct 19, 2017

clientv3: separate readyWait for ConnectNotify #8716

Merged

This was referenced Oct 19, 2017

clientv3: clean up retry wrapper, remove all FailFast=false #8717

Merged

clientv3: remove redundant retries in Lease, set FailFast=true #8718

Merged

clientv3/integration: block grpc.Dial until connection up in client closing tests #8720

Closed

redbaron mentioned this pull request Oct 20, 2017

apiserver timeouts after rolling-update of etcd cluster kubernetes/kubernetes#47131

Closed

xiang90 closed this Oct 20, 2017

gyuho deleted the fix-balancer-retry branch November 11, 2017 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3: fix balancer/retry #8710

clientv3: fix balancer/retry #8710

gyuho commented Oct 19, 2017 •

edited

Loading

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017 •

edited

Loading

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 Oct 19, 2017

xiang90 commented Oct 20, 2017


		const minDialDuration = 3 * time.Second

		func (c *Client) newRetryWrapper(write bool) retryRPCFunc {

clientv3: fix balancer/retry #8710

clientv3: fix balancer/retry #8710

Conversation

gyuho commented Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Oct 20, 2017

gyuho commented Oct 19, 2017 •

edited

Loading

xiang90 Oct 19, 2017 •

edited

Loading